IntroductionΒΆ
Companies invest significant resources in marketing campaigns to convert prospective customers into current ones. This process is essential for maintaining revenue levels, reinvesting to improve service quality, responding to shareholders, and more. Therefore, understanding why customers leave and predicting churn with enough anticipation to implement a retention plan is an integral part of business operations. With this in mind, this kernel will focus on exploring the available data to identify potential causes of customer churn, laying the groundwork for future machine learning models.
Instead of using Matplotlib or Seaborn, I chose Plotly for this kernel due to its excellent interactivity. Additionally, Plotly, when combined with Dash, provides tools to build web applications with interactive dashboards.
I hope you find this kernel helpful. If you have any questions or suggestions for improvement, please feel free to open an issue or start a discussion.
RequirementsΒΆ
- Pandas version: 1.5.3
- NumPy version: 1.24.4
- SciPy version: 1.10.1
- Matplotlib version: 3.7.0
- Missingno version: 0.5.2
- Plotly version: 5.13.1
Import Libraries librariesΒΆ
Utility functionsΒΆ
resumetable() | expanded_describe():ΒΆ
spot_missing_values():ΒΆ
reshape_df()ΒΆ
custom_bar_chart()ΒΆ
custome_histograms()ΒΆ
costum_boxplot()ΒΆ
custom_line_chart()ΒΆ
serv_dist_in_num_servs()ΒΆ
Understanding the dataΒΆ
Loading datasetΒΆ
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
Quick view to the dataset.ΒΆ
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3120 | 3148-AOIQT | Female | 0 | Yes | No | 69 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | Yes | Credit card (automatic) | 24.95 | 1718.35 | No |
| 3644 | 8219-VYBVI | Male | 0 | No | Yes | 39 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.00 | 1004.35 | No |
| 2429 | 8406-LNMHF | Male | 1 | Yes | No | 59 | Yes | No | Fiber optic | Yes | Yes | Yes | No | No | No | One year | Yes | Credit card (automatic) | 82.95 | 4903.15 | No |
General Info about Dataset:ΒΆ
More info could be found in the following link:
Column descriptionΒΆ
Demographic info:ΒΆ
- gender
- SeniorCitizen: Indicates if the customer is under 65 or older: Yes, No
- Partner
- Dependents
Company services:ΒΆ
- PhoneService,
- Multiple lines,
- Internet service,
- Online security,
- Online backup,
- Device protection,
- Tech support,
- Streaming TV,
- Streaming movies
Customer account information:ΒΆ
- CustomerID
- Contract
- Ternure
- Paperless billing
- Payment method
- Monthly charges
- Total charges
Additional considerations:ΒΆ
Column tenure indicates the months that a particular customer has been with the company by the end of the quarter (The fiscal quarter that the data has been derived from (e.g. Q3)). I will assume that all the customer cancelation dates are the same
| column_name | dtypes | unique2 | count | top2 | freq2 | mean | std | min | 25% | 50% | 75% | max | First Value | Second Value | Third Value | Entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | customerID | object | 7043 | 7043.0 | 0002-ORFBO | 1 | - | - | - | - | - | - | - | 7590-VHVEG | 5575-GNVDE | 3668-QPYBK | 3.85 |
| 1 | gender | object | 2 | 7043.0 | Male | 3555 | - | - | - | - | - | - | - | Female | Male | Male | 0.30 |
| 2 | SeniorCitizen | int64 | 2 | 7043.0 | 0.0 | 5901 | 0.162147 | 0.368612 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0 | 0.19 |
| 3 | Partner | object | 2 | 7043.0 | No | 3641 | - | - | - | - | - | - | - | Yes | No | No | 0.30 |
| 4 | Dependents | object | 2 | 7043.0 | No | 4933 | - | - | - | - | - | - | - | No | No | No | 0.27 |
| 5 | tenure | int64 | 73 | 7043.0 | 1.0 | 613 | 32.371149 | 24.559481 | 0.0 | 9.0 | 29.0 | 55.0 | 72.0 | 1 | 34 | 2 | 1.78 |
| 6 | PhoneService | object | 2 | 7043.0 | Yes | 6361 | - | - | - | - | - | - | - | No | Yes | Yes | 0.14 |
| 7 | MultipleLines | object | 3 | 7043.0 | No | 3390 | - | - | - | - | - | - | - | No phone service | No | No | 0.41 |
| 8 | InternetService | object | 3 | 7043.0 | Fiber optic | 3096 | - | - | - | - | - | - | - | DSL | DSL | DSL | 0.46 |
| 9 | OnlineSecurity | object | 3 | 7043.0 | No | 3498 | - | - | - | - | - | - | - | No | Yes | Yes | 0.45 |
| 10 | OnlineBackup | object | 3 | 7043.0 | No | 3088 | - | - | - | - | - | - | - | Yes | No | Yes | 0.46 |
| 11 | DeviceProtection | object | 3 | 7043.0 | No | 3095 | - | - | - | - | - | - | - | No | Yes | No | 0.46 |
| 12 | TechSupport | object | 3 | 7043.0 | No | 3473 | - | - | - | - | - | - | - | No | No | No | 0.45 |
| 13 | StreamingTV | object | 3 | 7043.0 | No | 2810 | - | - | - | - | - | - | - | No | No | No | 0.46 |
| 14 | StreamingMovies | object | 3 | 7043.0 | No | 2785 | - | - | - | - | - | - | - | No | No | No | 0.46 |
| 15 | Contract | object | 3 | 7043.0 | Month-to-month | 3875 | - | - | - | - | - | - | - | Month-to-month | One year | Month-to-month | 0.43 |
| 16 | PaperlessBilling | object | 2 | 7043.0 | Yes | 4171 | - | - | - | - | - | - | - | Yes | No | Yes | 0.29 |
| 17 | PaymentMethod | object | 4 | 7043.0 | Electronic check | 2365 | - | - | - | - | - | - | - | Electronic check | Mailed check | Mailed check | 0.59 |
| 18 | MonthlyCharges | float64 | 1585 | 7043.0 | 20.05 | 61 | 64.761692 | 30.090047 | 18.25 | 35.5 | 70.35 | 89.85 | 118.75 | 29.85 | 56.95 | 53.85 | 3.02 |
| 19 | TotalCharges | object | 6531 | 7043.0 | 11 | - | - | - | - | - | - | - | 29.85 | 1889.5 | 108.15 | 3.80 | |
| 20 | Churn | object | 2 | 7043.0 | No | 5174 | - | - | - | - | - | - | - | No | No | Yes | 0.25 |
Insights:ΒΆ
- customerID is unique, which means that we have data from 7043 different customers.
- TotalCharges, most of its values are unique. It has a wrong data types, and the top frequent value appear empty (' ').
- Our dataset has 21 columns and 7043 rows.
Spotting missing valuesΒΆ
| column_name | dtypes | isnull | value: ' ' | value: 'No apply' | value: 'empty' | total_missing | percent_missing | |
|---|---|---|---|---|---|---|---|---|
| 0 | customerID | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 1 | gender | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 2 | SeniorCitizen | int64 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 3 | Partner | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 4 | Dependents | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 5 | tenure | int64 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 6 | PhoneService | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 7 | MultipleLines | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 8 | InternetService | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 9 | OnlineSecurity | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 10 | OnlineBackup | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 11 | DeviceProtection | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 12 | TechSupport | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 13 | StreamingTV | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 14 | StreamingMovies | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 15 | Contract | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 16 | PaperlessBilling | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 17 | PaymentMethod | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 18 | MonthlyCharges | float64 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 19 | TotalCharges | object | 0 | 11.0 | 0.0 | 0.0 | 11.0 | 0.16% |
| 20 | Churn | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
Insights:ΒΆ
- Even though there is not missing values such as np.nan, there are strings values such as ' ' that indicate missing values.
- According to the previous table, the only column with missing values is TotalCharges. Since it represents 0.16% of the data, the plan of action will be to drop those 11 rows.
- customerID could be dropped. At this stage, it seems that the id code nomenclature does not hold any hidden information.
| column_name | dtypes | isnull | value: ' ' | value: 'No apply' | value: 'empty' | total_missing | percent_missing | |
|---|---|---|---|---|---|---|---|---|
| 0 | gender | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 1 | SeniorCitizen | int64 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 2 | Partner | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 3 | Dependents | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 4 | tenure | int64 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 5 | PhoneService | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 6 | MultipleLines | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 7 | InternetService | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 8 | OnlineSecurity | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 9 | OnlineBackup | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 10 | DeviceProtection | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 11 | TechSupport | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 12 | StreamingTV | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 13 | StreamingMovies | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 14 | Contract | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 15 | PaperlessBilling | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 16 | PaymentMethod | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 17 | MonthlyCharges | float64 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 18 | TotalCharges | float64 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
| 19 | Churn | object | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00% |
InsightsΒΆ
- Previous tables shows that there is not missing values. It possible to proceed with data visualization.
Data VisualizationΒΆ
InsigthsΒΆ
- The dependent variable is Churn. In general, 26.6% of customer churn, while the company was able to retain 73.4% of the customers.
- Since the objective is to predict customer churn, this dataset is unbalanced.
VIZ - Demographic evaluationΒΆ
InsightsΒΆ
- Sex distribution among male and female are similar. It is not showing a clear difference between churn and retained customers.
- Curiously, customers over 65 years old are more likely to churn (around 42% of churn rate). It could be related to age issues.
- In general, the number of customers with or without a partner is very similar. On the other hand, most customers do not have dependents (around 70% of them). The interesting point is that customers without a partner or dependents are more likely to churn (approximately 33% and 31%, respectively)
VIZ - Service evaluationΒΆ
InsightsΒΆ
- Even though most customers have phone service (around 90%), customer with or without service have a similar rate churn (around 25% to 27%).
- Customers with multiline phone services are slightly more likely to churn, around 29%, compared to the 25% churn rate for customers without multiline services.
- Around 78% of customer has internet service (fiber optic and DSL). Interestingly, customers with Fiber optic service are more likely to churn, with approximately a churn rate of 42%.
- Figures show that customers who do not complement their internet service with online security, online backup, device protection, or tech support are more likely to churn. The churn rate for those customers ranges between 39% to 42%. Additionally, the churn rate for customers without or with services such as streaming tv and streaming movies are similar, which is between 30% to 34% churn.
VIZ - Customer account informationΒΆ
InsightsΒΆ
Turning to customer service information, 55% of customers have a month-to-month contract with a high churn rate of approximately 43%, which is relatively high compared to 11% churn of customers with one-year contracts or the roughly 3% churn for customers with two-year contracts.
Including tenure in the analysis or contract and churn, the histogram shows that the greater the customer commitment, the later the customer service cancelation is. Around 62% of the month-to-month customers churn within the first year. In contrast to month-to-month customers, customers with one-year or two-year contracts seem to delay their churn. Around 70% of One-year contract customers cancel their service after three years, and 65% of two-year contract customers cancel their service after five years.
In general, comparing the churn rate per month, we can see that the more time the client is in the company, the lower the customer attrition will be.
Interestingly, 59% of customers prefer paperless billing; however, in terms of churn, paperless billing represents a 33% churn compared to other forms of billing that represent 16% churn.
Slightly over a third of the customers use electronic checks as preferred payment methods; however, it has a 45% churn. Other payment methods (such as mailed checks, automatic payment with bank transfer, and credit cards) present a churn rate of around 15% to 19%.
Around 62% of customers cancel their services the first month, representing 380 customers. After the first three months, the churn rate changes to be lower than the retention rate of customers, and for each month that the company keeps the client, the company can reduce the churn rate up to around 2% (customer with six years).
Additionally, according to the boxplot, it seems that customers with higher monthly charges are more likely to churn than those who have to pay lower monthly bills. This difference is accentuated for those customers with a longer time in the company.
As was expected, the total charges per customer increased as the customer tenure increased. However, it is interesting that customers with higher total charges are more likely to churn than customers with lower total charges.
Counting multiline as a service, each customer can add up to nine services. According to the bar chart, with around a 45% churn rate, customers with three services are more likely to churn.
If we give a deeper look at the service distribution of active services for customers who have selected 3 or 4 services, we can see that the churn rate for customers with Internet service (Fiber optic) reaches up to 59% and 51%, respectively.
Note: For the following steps (data mining models), I will keep the new column 'num_services'.